vit-b 16
- North America > Mexico > Puebla (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > California (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Oceania > Australia > New South Wales (0.04)
- (6 more...)
)V. (2) MSA is constructed based on Attention by split the channels ofQ,K and V into h groups with each group apart ofqueries, keys,and valuesQi,Ki RN
F s,iBs,i, n = 1,2,...,S, (10) where F s is the support features extracted by a pretrained ViT. Inspired by the multiple-object tracking within a single framework [21], in which different objects are represented by various identifications (i.e., learnable vectors) for simultaneously tracking, we add extra learnable tokens tothemeanfeatures formorediscriminativeprompts.
Structure is Supervision: Multiview Masked Autoencoders for Radiology
Laguna, Sonia, Agostini, Andrea, Ryser, Alain, Ruiperez-Campillo, Samuel, Cannistraci, Irene, Vandenhirtz, Moritz, Mandt, Stephan, Deperrois, Nicolas, Nooralahzadeh, Farhad, Krauthammer, Michael, Sutter, Thomas M., Vogt, Julia E.
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
- Europe > Switzerland > Zürich > Zürich (0.05)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- North America > United States (0.04)
- (3 more...)
- Research Report > Experimental Study (0.88)
- Research Report > New Finding (0.68)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)